Skip to content

feat(server): add serve log-level flag#57

Merged
raullenchai merged 1 commit intoraullenchai:mainfrom
XiaoPengMei:add-log-level-flag-50
Apr 11, 2026
Merged

feat(server): add serve log-level flag#57
raullenchai merged 1 commit intoraullenchai:mainfrom
XiaoPengMei:add-log-level-flag-50

Conversation

@XiaoPengMei
Copy link
Copy Markdown

Closes #50

Summary

  • add --log-level to both rapid-mlx serve and python -m vllm_mlx.server
  • apply the selected level to Python logging and pass the normalized value through to uvicorn.run
  • add focused tests that verify both entrypoints expose the new flag

Testing

  • pytest tests/test_harmony_parsers.py -k log_level
  • manual QA: invoked serve_command() with --log-level WARNING using a stubbed server and verified uvicorn.run(..., log_level='warning') plus root logger level 30

raullenchai pushed a commit that referenced this pull request Mar 26, 2026
…waybarrios#180)

* feat: MLLM+MTP per-request routing for text and vision

When both --mllm and --enable-mtp are set, SimpleEngine builds a
parallel mlx_lm TextModel sharing the VLM backbone weights (zero-copy).
Text-only requests route to mlx_lm with MTP speculative decoding;
media requests route to the mlx_vlm MLLM path.

Key components:
- text_model_from_vlm.py: Build mlx_lm TextModel from VLM weights
- Per-request routing in stream_chat() via _has_media_content()
- _stream_generate_text() for MTP-accelerated text generation
- MTP passthrough: --enable-mtp flag through CLI/server/engine/LLM

Tested on Qwen3.5-35B-A3B VLM+MTP (8-bit):
- Text (MTP): 65.3 tok/s
- Vision (MLLM): 63.8 tok/s
- Memory: 38.7 GB (zero-copy, same as single model)

* feat: system prompt KV caching for SimpleEngine MTP text path

Persist backbone KV cache after prefilling system prompt tokens.
On subsequent requests with the same system prompt, restore the
snapshot and only prefill the suffix (user + history) tokens.

For a 10K-token system prompt on the 122B model, this saves ~57s
per request by avoiding redundant system prompt prefill.

Implementation:
- Detect system prefix via ChatML boundary markers
- Hash prefix text for cache key validation
- On cache miss: prefill system tokens, snapshot backbone KV state
- On cache hit: restore snapshot into fresh cache, send suffix only
- Token prefix validation ensures correct split at tokenization boundary
- Single-entry cache (one system prompt at a time)
- Stats exposed via get_stats() → system_kv_cache
- Cache cleared on stop(), invalidated on system prompt change

* feat: SpecPrefill — attention-based sparse prefill for TTFT reduction

Uses a small draft model to identify important prompt tokens via attention
scoring, then sparse-prefills the target model with only those tokens while
preserving original positional encoding via manual RoPE. Reduces TTFT
2.8-3.1x on 122B and 1.8x on 35B at 20% keep rate.

Implementation:
- specprefill.py: Core module with score_tokens(), select_chunks(),
  sparse_prefill(), cleanup_rope() (~640 lines)
- SimpleEngine integration: draft model loading, threshold-based activation,
  composition with system prompt KV cache, graceful fallback on error
- Per-request API: specprefill (bool) + specprefill_keep_pct (float)
  via extra_body for per-request control
- CLI: --specprefill, --specprefill-threshold, --specprefill-keep-pct,
  --specprefill-draft-model, --prefill-step-size

Closes waybarrios#179. Related: waybarrios#178 (TTFT), #57 (speculative decoding).

* feat: multi-architecture support for SpecPrefill scoring and sparse prefill

Add support for three model architecture families with auto-detection:

- Qwen3.5: gate split + q_norm + RoPE (existing, now refactored)
- Nemotron-H: content-based attention (no RoPE), mixer attr, compacted cache
- GPT-OSS/Llama: standard q_proj + RoPE (GQA, YarnRoPE compatible)

Key changes:
- Architecture-specific query extractors (_qwen35, _llama, _nemotron_h)
- Auto-detection in score_tokens() via model attributes (q_norm/rope/mixer)
- _get_attn_module()/_set_attn_module() abstract self_attn vs mixer access
- _find_attention_layers() handles block_type="*" (Nemotron-H attention)
- _build_layer_to_cache_map() handles compacted cache indexing
- sparse_prefill() skips RoPE patching for architectures without it
- cleanup_rope() is no-op for RoPE-less architectures
- Remove score_tokens_self() stub (CritiPrefill not viable for MoE)

Tested on Qwen3.5 4B (positions + pipeline). Nemotron-H and GPT-OSS
code paths ready for empirical validation.

* fix: handle GPT-OSS sliding window caches and head attribute naming

Two bugs found during cross-architecture testing on GPT-OSS 120B:

1. _llama_extract_queries() used eager evaluation in getattr fallback
   chain: getattr(attn, "num_attention_heads", attn.num_heads) evaluates
   attn.num_heads before checking if num_attention_heads exists. Fixed to
   use safe nested getattr with None default.

2. _compute_importance() concatenated score matrices with different
   shapes when mixing sliding window (128-token RotatingKVCache) and
   full attention (unlimited KVCache) layers. Fixed by skipping layers
   whose cache spans fewer tokens than the full prompt.

Validated on GPT-OSS 120B + 20B draft: importance-based selection
produces coherent output while uniform selection degrades, confirming
scoring signal from 18 full-attention layers is sufficient.

* fix: preserve tail tokens for models with RotatingKVCache

Models with sliding window attention (e.g., GPT-OSS alternating
sliding/full layers) use RotatingKVCache that evicts old entries.
When sparse prefill inserts more tokens than the window size, the
cache loses context needed for decode.

sparse_prefill() now auto-detects RotatingKVCache and augments the
selection to include the last max_size positions, ensuring sliding
window layers have valid recent context.

Validated: GPT-OSS 120B + 20B draft produces coherent output on
2294-token prompts (was garbage before this fix). Qwen3.5 and
Nemotron-H unaffected (no RotatingKVCache in their cache).

* feat: SpecPrefill support for non-MTP models (standard LLM path)

Add _stream_generate_specprefill() method for models that don't use MTP
speculative decoding (Nemotron, GPT-OSS, etc). The existing SpecPrefill
integration only worked in the MTP text path (_stream_generate_text).

Changes:
- stream_generate() now pops specprefill/specprefill_keep_pct from kwargs
  and dispatches to the new method when conditions are met
- _stream_generate_specprefill() follows the same pattern as the MTP path:
  score → select → sparse_prefill → autoregressive generation
- Graceful fallback to normal generation on any error
- Per-request overrides (specprefill, specprefill_keep_pct) via extra_body
- Threshold and upper-bound checks identical to MTP path
Copy link
Copy Markdown
Owner

@raullenchai raullenchai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @XiaoPengMei! The feature itself is straightforward and welcome. A few things to address before merge:

Issues

P1: Case-insensitive input rejected

choices=["DEBUG", "INFO", "WARNING", "ERROR"] means --log-level debug (lowercase) is rejected by argparse. Most CLI tools accept either case. Fix: add type=str.upper to the argument so any casing is accepted:

serve_parser.add_argument(
    "--log-level",
    type=str.upper,
    choices=["DEBUG", "INFO", "WARNING", "ERROR"],
    default="INFO",
    ...
)

This also makes normalize_log_level() unnecessary — argparse handles it.

P2: Root logger modification is too broad

logging.getLogger().setLevel(...) changes the root logger, which affects every library (httpx, uvicorn internals, asyncio, etc.). At DEBUG this floods output with noise unrelated to vllm-mlx.

Instead, only set the vllm_mlx logger hierarchy:

def configure_logging(log_level: str) -> str:
    level = getattr(logging, log_level, logging.INFO)
    logging.getLogger("vllm_mlx").setLevel(level)
    return log_level.lower()  # uvicorn wants lowercase

P3: Tests are source-code grep, not behavioral

The tests inspect source code strings ('"--log-level"' in source). This passes even if the flag is broken at runtime. A better approach would be to actually parse args through argparse and verify the result. Also, these tests belong in a CLI/server test file, not test_harmony_parsers.py.

Minor

  • normalize_log_level() is just .upper() — can be removed if you use type=str.upper in argparse.
  • Missing blank line between configure_logging() and # Global engine instance comment.

raullenchai added a commit that referenced this pull request Apr 6, 2026
* fix: Use streaming detokenizer for UTF-8-safe incremental decode

Replace per-token tokenizer.decode([token]) with a streaming
detokenizer that buffers partial UTF-8 byte sequences. This fixes
corrupted multi-byte characters (e.g. Czech 'ď' → '��') during
SSE streaming, caused by byte-level tokens being decoded individually
instead of accumulated until a complete UTF-8 character boundary.

Uses mlx_lm's NaiveStreamingDetokenizer (or the optimized
BPEStreamingDetokenizer when available via tokenizer.detokenizer)
with a per-request pool that is cleaned up on request completion.

Both LLM scheduler and MLLM scheduler are fixed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add --served-model-name CLI parameter

Allow users to serve a model under a different name in API responses,
matching vLLM's --served-model-name behavior.

* Fix prefix cache dir using served name instead of model path

The cache directory was derived from _model_name which could be
overridden by --served-model-name, causing cache misses when the
served name changed. Use the actual model path instead.

* Add Qwen3.5 model support with text-only loading and fix reasoning+tool streaming

- Add strict=False fallback in tokenizer loader for models with extra
  weights (e.g., vision tower params), enabling Qwen3.5 to load via
  mlx-lm as a text-only model
- Fix streaming tool call parsing when both --reasoning-parser and
  --tool-call-parser are enabled (previously mutually exclusive branches)
- Make memory pressure threshold dynamic based on system RAM instead
  of hardcoded 200GB

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: check trim method existence before calling

Fixes AttributeError when ArraysCache.is_trimmable() returns True
but the trim() method doesn't exist. Added hasattr check for trim
before calling it in scheduler.py lines 772 and 802.

Closes waybarrios#145

* fix(batched): add exclude_none=True to model_dump in image extraction

* fix: filter None values from dict() fallback and api/utils.py serialization

* fix: pass size to ArraysCache in BatchMambaCache for Qwen3.5 hybrid models

Qwen3.5 uses a hybrid architecture (Attention + Mamba/SSM layers), where
`model.make_cache()` returns a mix of `KVCache` and `ArraysCache` objects.

`ArraysCache.__init__()` requires a `size` parameter, but `BatchMambaCache`
conditionally skipped it when `HAS_MAMBA_CACHE=True`. Since `MambaCache`
was removed in mlx-lm >= 0.30.6 and falls back to `ArraysCache`, the
`HAS_MAMBA_CACHE` flag is unreliable.

This caused `--continuous-batching` mode to crash in an infinite error loop:
  `ArraysCache.__init__() missing 1 required positional argument: 'size'`

The fix unconditionally passes `size` to `super().__init__()`, which is
safe for both `ArraysCache` (requires it) and legacy `MambaCache`
(accepts it).

Without this fix, continuous batching and prefix caching are completely
broken for Qwen3.5 models on Apple Silicon.

Related upstream issues:
- ml-explore/mlx-lm#980 (prefix cache fails for hybrid models)
- QwenLM/Qwen3.6#37 (ArraysCache vs KVCache in hybrid arch)

* fix: compatibility with mlx-lm 0.31.x (prompt_checkpoints tuple)

mlx-lm 0.31.0 added prompt_checkpoints support, changing the
BatchGenerator.insert() tuple from 6 elements to 7. This causes
"ValueError: too many values to unpack (expected 6)" in
_chunked_next when processing any request.

Changes:
- scheduler.py line ~395: unpack 7 values (add _prompt_checkpoints)
- scheduler.py line ~406: pass max_kv_size=None to _make_cache()
  (signature changed in mlx-lm 0.31.0 to require 3 args)

Tested on Mac Mini M4 Pro 64GB with:
- mlx-lm 0.31.0
- mlx 0.31.1
- Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-4bit
- vllm-mlx 0.2.5 (this fork)

Fixes the same issue as jundot/omlx#110.

* fix(mllm_scheduler): add adaptive periodic cache clearing (waybarrios#157)

* fix: rename platform.py to vllm_platform.py to avoid stdlib shadowing

* fix: handle video_url content type and fix video frame token counting

Three bugs fixed:

1. video_url content type silently ignored in MLLM chat() and stream_chat().
   The OpenAI API video format uses {"type": "video_url", "video_url": {"url": ...}}
   but only "video" type was handled. Fixes #120.

2. Video frames extracted AFTER chat template built, causing token count
   mismatch (template has 0 image tokens but vision encoder produces N*frame
   features). Restructured to two-pass approach: extract video frames first,
   then build chat template with correct frame counts.

3. server.py has_media always False in MLLM mode because images/videos are
   extracted from messages internally (set to []). Added MLLM-specific check
   so video_fps/video_max_frames params still reach chat() via chat_kwargs.

* feat: native Qwen3-VL video pipeline with temporal 3D conv + M-RoPE

For models with video_token_id (Qwen-family), video inputs now flow through
mlx-vlm's native video pipeline instead of being treated as individual images.

This activates:
- 3D conv frame pairing (temporal_patch_size=2)
- M-RoPE temporal position IDs (interleaved layout)
- Timestamp-frame interleaving in the prompt
- Proper video_grid_thw for the vision encoder

Falls back to frame-as-images for non-video models.

Adds _generate_native_video() and _translate_messages_for_native_video()
to MLXMultimodalLM, plus unit tests for video URL parsing, frame count
alignment, and message translation.

* style: ruff format + lint fixes for new code

* Fix video native init, import guard, empty source and has_media detection

* feat: SpecPrefill — attention-based sparse prefill for TTFT reduction (waybarrios#180)

* feat: MLLM+MTP per-request routing for text and vision

When both --mllm and --enable-mtp are set, SimpleEngine builds a
parallel mlx_lm TextModel sharing the VLM backbone weights (zero-copy).
Text-only requests route to mlx_lm with MTP speculative decoding;
media requests route to the mlx_vlm MLLM path.

Key components:
- text_model_from_vlm.py: Build mlx_lm TextModel from VLM weights
- Per-request routing in stream_chat() via _has_media_content()
- _stream_generate_text() for MTP-accelerated text generation
- MTP passthrough: --enable-mtp flag through CLI/server/engine/LLM

Tested on Qwen3.5-35B-A3B VLM+MTP (8-bit):
- Text (MTP): 65.3 tok/s
- Vision (MLLM): 63.8 tok/s
- Memory: 38.7 GB (zero-copy, same as single model)

* feat: system prompt KV caching for SimpleEngine MTP text path

Persist backbone KV cache after prefilling system prompt tokens.
On subsequent requests with the same system prompt, restore the
snapshot and only prefill the suffix (user + history) tokens.

For a 10K-token system prompt on the 122B model, this saves ~57s
per request by avoiding redundant system prompt prefill.

Implementation:
- Detect system prefix via ChatML boundary markers
- Hash prefix text for cache key validation
- On cache miss: prefill system tokens, snapshot backbone KV state
- On cache hit: restore snapshot into fresh cache, send suffix only
- Token prefix validation ensures correct split at tokenization boundary
- Single-entry cache (one system prompt at a time)
- Stats exposed via get_stats() → system_kv_cache
- Cache cleared on stop(), invalidated on system prompt change

* feat: SpecPrefill — attention-based sparse prefill for TTFT reduction

Uses a small draft model to identify important prompt tokens via attention
scoring, then sparse-prefills the target model with only those tokens while
preserving original positional encoding via manual RoPE. Reduces TTFT
2.8-3.1x on 122B and 1.8x on 35B at 20% keep rate.

Implementation:
- specprefill.py: Core module with score_tokens(), select_chunks(),
  sparse_prefill(), cleanup_rope() (~640 lines)
- SimpleEngine integration: draft model loading, threshold-based activation,
  composition with system prompt KV cache, graceful fallback on error
- Per-request API: specprefill (bool) + specprefill_keep_pct (float)
  via extra_body for per-request control
- CLI: --specprefill, --specprefill-threshold, --specprefill-keep-pct,
  --specprefill-draft-model, --prefill-step-size

Closes waybarrios#179. Related: waybarrios#178 (TTFT), #57 (speculative decoding).

* feat: multi-architecture support for SpecPrefill scoring and sparse prefill

Add support for three model architecture families with auto-detection:

- Qwen3.5: gate split + q_norm + RoPE (existing, now refactored)
- Nemotron-H: content-based attention (no RoPE), mixer attr, compacted cache
- GPT-OSS/Llama: standard q_proj + RoPE (GQA, YarnRoPE compatible)

Key changes:
- Architecture-specific query extractors (_qwen35, _llama, _nemotron_h)
- Auto-detection in score_tokens() via model attributes (q_norm/rope/mixer)
- _get_attn_module()/_set_attn_module() abstract self_attn vs mixer access
- _find_attention_layers() handles block_type="*" (Nemotron-H attention)
- _build_layer_to_cache_map() handles compacted cache indexing
- sparse_prefill() skips RoPE patching for architectures without it
- cleanup_rope() is no-op for RoPE-less architectures
- Remove score_tokens_self() stub (CritiPrefill not viable for MoE)

Tested on Qwen3.5 4B (positions + pipeline). Nemotron-H and GPT-OSS
code paths ready for empirical validation.

* fix: handle GPT-OSS sliding window caches and head attribute naming

Two bugs found during cross-architecture testing on GPT-OSS 120B:

1. _llama_extract_queries() used eager evaluation in getattr fallback
   chain: getattr(attn, "num_attention_heads", attn.num_heads) evaluates
   attn.num_heads before checking if num_attention_heads exists. Fixed to
   use safe nested getattr with None default.

2. _compute_importance() concatenated score matrices with different
   shapes when mixing sliding window (128-token RotatingKVCache) and
   full attention (unlimited KVCache) layers. Fixed by skipping layers
   whose cache spans fewer tokens than the full prompt.

Validated on GPT-OSS 120B + 20B draft: importance-based selection
produces coherent output while uniform selection degrades, confirming
scoring signal from 18 full-attention layers is sufficient.

* fix: preserve tail tokens for models with RotatingKVCache

Models with sliding window attention (e.g., GPT-OSS alternating
sliding/full layers) use RotatingKVCache that evicts old entries.
When sparse prefill inserts more tokens than the window size, the
cache loses context needed for decode.

sparse_prefill() now auto-detects RotatingKVCache and augments the
selection to include the last max_size positions, ensuring sliding
window layers have valid recent context.

Validated: GPT-OSS 120B + 20B draft produces coherent output on
2294-token prompts (was garbage before this fix). Qwen3.5 and
Nemotron-H unaffected (no RotatingKVCache in their cache).

* feat: SpecPrefill support for non-MTP models (standard LLM path)

Add _stream_generate_specprefill() method for models that don't use MTP
speculative decoding (Nemotron, GPT-OSS, etc). The existing SpecPrefill
integration only worked in the MTP text path (_stream_generate_text).

Changes:
- stream_generate() now pops specprefill/specprefill_keep_pct from kwargs
  and dispatches to the new method when conditions are met
- _stream_generate_specprefill() follows the same pattern as the MTP path:
  score → select → sparse_prefill → autoregressive generation
- Graceful fallback to normal generation on any error
- Per-request overrides (specprefill, specprefill_keep_pct) via extra_body
- Threshold and upper-bound checks identical to MTP path

* remove streaming tool fix (covered by waybarrios#148) and fix eos_token_ids in strict=False loader

* fix: address PR waybarrios#150 review — tool forwarding, kwargs safety, video_generate wiring

- Forward tools to apply_chat_template in native video path (fixes
  silent tool-call drop, regression from PR #124)
- Pop tools, use_cache, video_fps, video_max_frames from kwargs
  before native video branch in chat() and stream_chat() to prevent
  leaking into mlx_vlm.generate()
- Extract _collect_video_inputs() to deduplicate video extraction
  between chat() and stream_chat()
- Split _generate_native_video into _prepare_native_video_inputs
  (preprocessing) + _generate_native_video (generation) wired
  through mlx_vlm.video_generate for clearer intent and easier
  adoption of upstream improvements
- Add ImportError guard on video_generate import in
  _generate_native_video to match codebase convention
- Document blocking stream_chat native video path — no upstream
  streaming API, engine wraps in asyncio.to_thread()
- Add tests for multi-message videos, multiple videos per message,
  video_url translation, Pydantic handling, tool forwarding,
  video_generate import verification

* fix lint CI to use python 3.13 for black compatibility

* format engine_core.py long line

* fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection

- ensure_mamba_support() now no-op: mlx-lm >= 0.30.6 ArraysCache has
  native batch support; old patch broke hybrid models (ArraysCache + KVCache)
- Add inject_mtp_support(): dynamically create MTP module, load weights,
  and monkey-patch model class with return_hidden/mtp_forward/make_mtp_cache
- Add _try_inject_mtp_post_load: auto-detect and inject MTP weights
  stripped by sanitize() during mlx_lm.load()
- Add strict=False fallback for models with extra MTP parameters
- validate_mtp_support: support model.language_model.args hierarchy
- Improve engine loop error logging with full traceback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* format test_video.py

* remove dead code in _load_strict_false

* Don’t truncate base64 images before hashing.

Truncating the string causes similar-but-not-the-same base64 JPGs to return the same hash, causing vllm-mlx to use the same cached image for all of them, resulting in duplicated and incorrect responses.

* fix: bump mlx-lm minimum to 0.31.0 for hybrid model batching

ArraysCache gained native batching support (extract, merge, filter,
prepare) in mlx-lm 0.31.0. Older versions crash with
"ArraysCache.__init__() missing 1 required positional argument: 'size'"
when continuous batching encounters hybrid models like Qwen3.5 that
mix KVCache and ArraysCache layers.

Fixes computor-org#11

* fix: alias validation, Hub model MTP routing, non-streaming text path, use_cache double-pop

P1: _validate_model_name() now accepts _model_alias and _model_path
    so alias-based requests don't 404 before legacy check runs.

P1: build_text_model() resolves Hub repo IDs via snapshot_download
    (no-op if cached) so MLLM+MTP routing works for non-local models.

P2: Non-streaming chat() now routes text-only requests through
    _stream_generate_text() matching stream_chat() behavior.

P2: Remove duplicate kwargs.pop("use_cache", True) in mllm.py that
    overwrote the caller's value after the first pop consumed the key.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: non-streaming text-only MTP deadlock and accumulation bug

P1: Move text-only MTP routing before _generation_lock acquisition.
_stream_generate_text() acquires the lock internally, so calling it
inside the lock caused an asyncio.Lock deadlock (not re-entrant).

P2: Use last chunk's accumulated text instead of concatenating deltas.
_stream_generate_text() yields full accumulated text on each chunk,
not deltas. Also use chunk.completion_tokens directly instead of
len(tokens) which was always 0.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: forward stop sequences to text-only MTP generation path

_stream_generate_text() now accepts a stop parameter and checks
accumulated text against stop sequences in the yield loop, matching
the behavior of MLXLanguageModel.stream_generate(). Both callers
(chat() and stream_chat()) now pass stop through.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: truncate new_text on stop hit so SSE streams omit stop sequence

When a stop sequence is found in accumulated_text, also trim new_text
by the same overshoot so streaming clients never see the stop string.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use self.max_kv_size instead of None in _make_cache call

Respects user-configured KV cache limits on the chunked prefill path
instead of silently defaulting to None.

Credit: @dougborg for catching this.

* fix: report prompt_tokens correctly for LLM models in SimpleEngine

LLM.stream_generate() never set prompt_tokens on StreamingOutput, so
the API always reported 0 prompt tokens for text-only models (including
MiniMax-M2.5). The MLLM+MTP path worked because it tokenizes the prompt
for KV caching, but the standard LLM path never counted.

Changes:
- Add prompt_tokens field to StreamingOutput dataclass
- Count prompt tokens in LLM.stream_generate() via tokenizer.encode()
- Add fallback in SimpleEngine.stream_generate() for normal finish path
- Count prompt tokens in SimpleEngine.chat() non-streaming LLM path

Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>

* format scheduler.py trim checks from PR waybarrios#152

* cleanup: remove redundant fallback tokenization and defensive hasattr checks

* bump version to 0.2.7

* format scheduler.py _make_cache call from PR waybarrios#183

* remove unused HAS_MAMBA_CACHE flag

* fix: clean up detokenizer pool in abort, reset, and error recovery paths

* fix: skip stop tokens in mllm_scheduler detokenizer to match scheduler.py

* fix: suppress tool call XML from streaming text content (#129)

Tool call XML (e.g. <minimax:tool_call>, <tool_call>) was leaking into
streaming text deltas via the /v1/messages endpoint. The raw markup
appeared in the client's conversation context alongside the structured
tool_use block, doubling token consumption for every tool call.

Add StreamingToolCallFilter that buffers streaming text and suppresses
content inside tool call blocks. Handles tags split across multiple
deltas, multiple tool calls per response, and preserves <think> blocks.

Supports MiniMax (<minimax:tool_call>) and Qwen (<tool_call>) formats.

14 unit tests included.

Fixes #129

* fix: also filter Qwen3 bracket-style tool calls from streaming

Add [Calling tool: ...)] to the streaming filter tag list.
MiniMax-M2.5 uses this format for some tool calls alongside its
native XML format.

Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>

* fix: filter all tool call format variants from streaming

MiniMax generates multiple tool call formats:
- <minimax:tool_call> XML (native)
- <tool_call> (Qwen)
- [Calling tool: ...] and [Calling tool=...] (bracket variants)
- [TOOL_CALL]...[/TOOL_CALL] (block format)

Consolidate bracket variants under single [Calling tool prefix with
newline as delimiter. Add [TOOL_CALL] block format.

Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>

* fix: add Llama function format to streaming filter

Add <function=name>...</function> (Llama-style) to filtered tags.
Now covers all formats supported by parse_tool_calls():
- MiniMax XML, Qwen XML, Qwen3 bracket, Llama function,
  Nemotron (via <tool_call>), and [TOOL_CALL] block.

Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>

* feat: route <think> blocks to Anthropic thinking content blocks

Add StreamingThinkRouter that separates thinking from response text.
Models that inject <think> in the generation prompt (MiniMax, Qwen3,
DeepSeek-R1) are auto-detected from the chat template.

Stream pipeline: raw text → tool call filter → think router → emit

Thinking content emits as Anthropic thinking content blocks
(thinking_delta) so clients render them distinctly from responses.

* chore: remove uv.lock from PR

uv.lock is not tracked upstream - accidentally included.

Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>

* fix: track prompt_tokens in Anthropic streaming endpoint

_stream_anthropic_messages() never read prompt_tokens from the engine,
always reporting 0 input_tokens. Now tracks prompt_tokens alongside
completion_tokens and includes input_tokens in message_delta usage.

Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>

* address review: add ThinkRouter tests, integration tests, refactor block emission

Addresses PR waybarrios#232 review feedback from Thump604:

1. StreamingThinkRouter unit tests (18 tests):
   - start_in_thinking mode and transition to text
   - Partial tag handling (held back, split across deltas, false alarms)
   - Multiple think blocks, token-by-token streaming
   - Flush behavior and state reset

2. Integration tests (12 tests):
   - Full pipeline: tool_filter → think_router → SSE events
   - Pure text, thinking→text, start_in_thinking→text
   - Tool call suppression with accumulated text preserved
   - Mixed thinking + text + tool calls
   - Block index increment verification

3. Refactored _emit_content_pieces() helper:
   - Extracts block transition logic (was repeated 3x in server.py)
   - Handles block_start/stop/delta emission
   - Returns updated state (block_type, index) for caller

44 tests passing across filter, router, and integration suites.

Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>

* style: apply black formatting to pass CI lint

Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>

* fix: address 3 IMPORTANT items from medical-grade review

1. [Calling tool] close tag changed from "\n" to "]\n" to prevent
   premature close on multi-line JSON args. Added tests for bracket-
   style and multi-line tool calls.

2. Buffer safety cap (1MB) on unclosed tool call blocks with warning
   log when exceeded. Prevents unbounded memory growth from pathological
   input. Added test for cap behavior.

3. accumulated_text now tracks raw delta_text before special token
   cleaning, ensuring tool call parsing is independent of the
   SPECIAL_TOKENS_PATTERN. Matches integration test behavior.

47 tests passing (44 existing + 3 new).

Co-Authored-By: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>

* fix: missing return in load_model_with_fallback success path

The merge accidentally dropped the `return model, tokenizer` after the
successful `load()` call in tokenizer.py. This caused all model loading
to return None and crash with "cannot unpack non-iterable NoneType".

Also update test_api_models owned_by assertion to match our branding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* bench: re-run benchmarks post-merge on 4 models

Verified merge integrity with end-to-end benchmarks:

- Qwen3.5-35B-A3B 8bit: 83.1 tok/s, 100% tools, 0% leak
- MiniMax-M2.5 4bit: 51.7 tok/s, 100% tools, 0% leak
- Qwen3.5-4B 4bit: 161.5 tok/s, 100% tools, 0% leak
- GLM-4.5-Air 4bit: 100% tools (decode anomaly — model stops early)

All results consistent with pre-merge README data.
1968 unit tests + 7 e2e tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* bench: comprehensive 14-model benchmark + agent integration tests

Benchmarked 14 models post-merge on Mac Studio M3 Ultra (256GB):

Qwen family (100% tools):
- Qwen3.5-4B 4bit:     161.5 tok/s, 2.9 GB
- Qwen3.5-9B 4bit:     99.8 tok/s, 5.4 GB
- Qwen3.5-27B 4bit:    39.0 tok/s, 14.8 GB
- Qwen3.5-35B-A3B 8b:  83.1 tok/s, 35.0 GB
- Qwen3-Coder-Next 4b: 74.5 tok/s, 42.4 GB
- MiniMax-M2.5 4bit:   51.7 tok/s, 120.4 GB

Non-Qwen:
- Llama-3.2-3B:    226.5 tok/s (fastest, no tools)
- Hermes-3-8B:     123.4 tok/s (no tools)
- Phi-4-mini:      174.0 tok/s (no tools, 100% leak)
- Gemma-3-12B:     48.4 tok/s (no tools)
- Mistral-Small:   pending (see json)
- Devstral-24B:    29.6 tok/s (no tools)
- GPT-OSS-20B:     58.5 tok/s (no tools)
- GLM-4.5-Air:     ~49 tok/s (100% tools)

Agent integration verified:
- LangChain: basic chat + tool calling ✓
- OpenAI SDK: streaming + tool calling ✓
- Aider-style: code editing + multi-turn ✓
- OpenCode-style: streaming tool calls ✓

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add Gemma 4 sanitize monkey-patch for mlx-vlm weight loading

mlx-vlm 0.4.3 has two bugs loading Gemma 4 models:
1. sanitize() doubles the 'language_model.model.' prefix
2. MLX-format models skip sanitize entirely

Our patch intercepts load_model(), fixes the prefix mapping, and
force-reloads weights when scales are detected as all-zero.

Linear layer weights load correctly with this patch. Embedding layer
remains broken due to upstream quantization issue (scales are zero
in the safetensors files themselves — mlx-community model bug).

Upstream: Blaizzy/mlx-vlm#912
TODO: Remove patch once mlx-vlm fixes the bug.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* revert: remove Gemma 4 monkey-patch, use mlx-vlm 0.4.3 as-is

mlx-vlm 0.4.3 supports Gemma 4 natively with bf16 weights
(google/gemma-4-31b-it). The quantized model embedding issue
is an upstream mlx-community quantization bug, not mlx-vlm's.

Use bf16 original weights instead of patching around broken quants.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: MLLM streaming crash — build_prompt not supported for VLM models

The eager template validation in streaming chat called build_prompt()
which throws RuntimeError for MLLM models. Skip the check when
engine.is_mllm is true.

Fixes Gemma 4 (and all MLLM) streaming 500 errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: graceful handling of MLLM stream_chat cleanup errors

Some VLM models (e.g. Gemma 4) raise concatenate errors during
generator cleanup after generation completes. If we already have
output tokens, log the warning and treat as finished instead of
crashing the response.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* deps: bump mlx-vlm minimum to 0.4.4 for Gemma 4 support

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add Gemma 4 tool call parser

Gemma 4 uses a native tool format: <|tool_call>call:name{k:<|"|>v<|"|>}<tool_call|>
This parser handles both non-streaming and streaming extraction.

Auto-detected for gemma4 model names. Registered as "gemma4" / "gemma_4".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: add global exception handler + MLLM error logging for prod resilience

- Global FastAPI exception handler catches unhandled errors and returns
  JSON 500 instead of killing the connection. Server stays alive.
- MLLM chat() path now logs full traceback on failure before re-raising.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: Gemma 4 text-only LLM path — prompt cache + all optimizations

Load Gemma 4 via mlx-vlm's LanguageModel but route through the LLM
path (not MLLM), enabling prompt cache, KV trim, and all decode
optimizations.

Gemma4TextWrapper adapts LanguageModelOutput → raw logits for mlx-lm
generate_step() compatibility. Cache is fully trimmable (60 KVCache
+ RotatingKVCache layers).

Auto-detected: gemma4 models skip --mllm and go through LLM path.
Requires bf16 weights (quantized models have embedding issues).

TODO: Remove once mlx-lm adds native gemma4 support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* bench: Gemma 4 31B — LLM path 5.2x faster than MLLM path

Gemma 4 31B benchmark results (Mac Studio M3 Ultra 256GB):

LLM path (4bit, with prompt cache):
- Decode: 32 tok/s
- TTFT cached: 242ms
- Tool calling: 100%
- RAM: 17 GB

MLLM path (bf16, no cache):
- Decode: 6.1 tok/s
- TTFT: 874ms (no cache)
- RAM: 59 GB

LLM path advantage: 5.2x decode, 3.6x TTFT, 3.5x less RAM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: mixed quantization support for Gemma 4 E4B/E2B models

mlx_vlm.convert produces mixed-quant models (4bit default, 8bit MLP).
Parse per-layer overrides from config and pass as class_predicate to
nn.quantize(). Also fix tool parser super().reset() call.

Tested: 31B 4bit (uniform), E4B 4bit (mixed) — both work.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: strip Gemma 4 thinking tags and turn markers from output

- Add <|channel>thought...<channel|> to THINK_PATTERN for non-streaming
- Add <|turn>, <turn|> to special token filter
- Non-streaming Gemma 4 output now clean (thinking stripped)
- Streaming still leaks thinking (needs gemma4 reasoning parser — TODO)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: mixed quant path matching + filter override keys

- Fix no-op ternary in override key processing
- Use path.endswith(suffix) instead of substring match to prevent
  false positives (e.g., layers.0 matching layers.10)
- Filter override config to only bits/group_size/mode keys

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add Gemma 4 benchmarks to README

Day-0 Gemma 4 support on Rapid-MLX:
- Gemma 4 26B-A4B MoE: 71 tok/s, 100% tools, 16 GB RAM
- Gemma 4 31B dense: 32 tok/s, 100% tools, 17 GB RAM
- 5.2x faster than mlx-vlm (LLM path with prompt cache)
- Custom gemma4 tool call parser (18th parser format)
- TTFT: 0.24s cached (vs 0.87s mlx-vlm)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: Gemma 4 reasoning parser — streaming thought/content separation

Custom parser for Gemma 4's channel-based thinking format:
  <|channel>thought\n...reasoning...<channel|>
  <|channel>content\n...answer...<channel|>

Streaming: thinking goes to 'reasoning' field, answer to 'content'.
Non-streaming: strip_thinking_tags removes thought blocks.
No more thinking leakage in client output.

Closes #62.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: token-level OutputRouter — config-driven output channel routing

New architecture for separating thinking/content/tool_calls:
- Token-ID based state machine (no regex, no text matching)
- Config-driven: reads special token IDs from tokenizer vocabulary
- Auto-detects Gemma 4 format from tokenizer vocab
- Single unified interface for both streaming and non-streaming
- 17 unit tests covering all routing scenarios

Design: OutputRouter.from_tokenizer(tokenizer) → state machine that
routes each token to CONTENT/REASONING/TOOL_CALL/CONTROL channels.

Currently implements Gemma 4 (channel tokens + tool_call tokens).
Future: migrate Qwen3 (<think>), DeepSeek, etc. to same architecture.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: integrate OutputRouter into streaming pipeline

Wire token-level OutputRouter into the full streaming path:
- MLXLanguageModel.stream_generate() routes each token via router
- StreamingOutput gains 'channel' field ("content"/"reasoning"/"tool_call")
- GenerationOutput passes channel through SimpleEngine to server
- Server's stream_chat_completion uses channel for direct routing,
  bypassing regex-based reasoning parser for router-enabled models

Gemma 4 streaming now uses token-level routing:
- Zero thinking leakage (verified: 4/4 integration tests)
- Content/reasoning cleanly separated
- Tool calls properly accumulated and emitted
- Old regex parsers remain as fallback for non-router models

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: 6 issues from self-review on OutputRouter integration

From 3-round review:
- Init _output_router=None in __init__ (AttributeError risk)
- Move token_count++ after router suppress check (inflated count)
- Use vocab.get() in from_tokenizer (KeyError on partial vocab)
- Lazy decode: skip tokenizer.decode() for suppressed control tokens
- Add try/except around router.feed() with fallback to decoder
- Remove dead control_ids property

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update Gemma 4 benchmarks — full lineup with OutputRouter

Final Gemma 4 benchmark results (Mac Studio M3 Ultra 256GB):

| Model              | Decode    | TTFT cached | Tools | RAM     |
|---------------------|-----------|-------------|-------|---------|
| Gemma 4 26B-A4B 4b | 93.5 t/s  | 252ms       | 100%  | 14.4 GB |
| Gemma 4 E4B 4bit   | 82.8 t/s  | 253ms       | 100%  | 6.4 GB  |
| Gemma 4 31B 4bit   | 30.9 t/s  | 339ms       | 100%  | 17.0 GB |
| Gemma 4 31B bf16   | 10.9 t/s  | 574ms       | 100%  | 58.1 GB |

All models: 100% tool calling, 100% recovery, 0% leak.
Agent integration: 9/10 (OpenAI, LangChain, OpenCode, Aider).
Token-level OutputRouter: zero thinking leakage in streaming.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* release: v0.4.0 — Gemma 4 day-0 support + token-level OutputRouter

Major release highlights:
- Day-0 Gemma 4 full lineup: E4B (83 tok/s), 26B-A4B (94 tok/s), 31B (31 tok/s)
- Token-level OutputRouter: config-driven channel routing, zero regex
- Gemma 4 tool call parser (18th format) + reasoning parser
- 5.2x faster than mlx-vlm MLLM path (LLM path + prompt cache)
- 100% tool calling, 0% thinking leakage across all Gemma 4 models
- Upstream sync: 43 commits from waybarrios/vllm-mlx (SpecPrefill, streaming
  filters, Anthropic think blocks, detokenizer, v0.2.7)
- Global exception handler for production resilience
- mlx-vlm >= 0.4.4, mlx-lm >= 0.31.0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: suppress orphan tool/response tokens in OutputRouter

Gemma 4 models can emit <tool_call|>, <tool_response|>, <|tool>, <tool|>
without matching opening tags during multi-round degradation. These
leaked into client output.

Now suppressed at token level:
- Orphan <tool_call|> outside TOOL_CALL state
- <|tool_response>, <tool_response|>, <|tool>, <tool|> always suppressed

Added 4 tests for orphan token handling. Total: 21 router tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: recover text-format tool calls from degraded Gemma 4 output

Gemma 4 models degrade to [Calling tool: name({...})] format after
multiple tool rounds at low quantization. The gemma4 tool parser now
catches this pattern and converts it to structured tool_calls.

Also triggers tool markup detection on '[' character (not just '<')
so the streaming tool parser path activates for text-format calls.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: final sanitizer — last-mile catch-all against markup leakage

Defense-in-depth: sanitize_output() runs on every content delta
before reaching the client. Catches ANY remaining markup:
- <|..> and <..|> asymmetric tokens (Gemma 4)
- <|..|> symmetric tokens (Qwen, GPT-OSS)
- [Calling tool:...] text-format degradation
- Stray </think>, </tool_call> closing tags

Applied in _fast_sse_chunk (hot path) and Pydantic chunk path.
Better to over-strip than to leak.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: 3 bugs from self-review on final sanitizer

1. Third streaming path missing '[' check in tool_markup_possible
2. Third streaming path Pydantic fallback missing sanitize_output()
3. Text-format tool call recovery had no deduplication (re-emitted
   same tool call on every subsequent delta)

Also: guard empty SSE chunks from _fast_sse_chunk when sanitizer
strips all content.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* bench: final Gemma 4 verification — 0% leak, 6/6 agent tests, all models

Post-sanitizer benchmark + agent integration (M3 Ultra 256GB):

| Model              | Decode   | Tools | Leak | Agent |
|---------------------|----------|-------|------|-------|
| 26B-A4B MoE 4bit   | 93.5 t/s | 100%  | 0%   | 6/6   |
| 31B dense 4bit     | 31.0 t/s | 100%  | 0%   | 6/6   |
| E4B 4bit           | 82.2 t/s | 100%  | 0%   | 6/6   |

Agent tests: OpenAI SDK, streaming, tools, streaming tools,
LangChain bind_tools, multi-turn coding. All passed.

1989 unit tests + 21 OutputRouter tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jan Hilgard <jan.hilgard@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: otarkhan <osama.taha1994@gmail.com>
Co-authored-by: OpenClaw <openclaw@example.com>
Co-authored-by: kol22 <kol.prue@gmail.com>
Co-authored-by: NeoMody <neomody77@gmail.com>
Co-authored-by: hkstrongside <hkstrongside@users.noreply.github.com>
Co-authored-by: Kolden Prue <74475667+kol22@users.noreply.github.com>
Co-authored-by: dan cooper <dan.cooper@berkeley.edu>
Co-authored-by: patanet7 <patanet71@gmail.com>
Co-authored-by: Wayner Barrios <waybarrios@gmail.com>
Co-authored-by: Thump604 <thump@cosmiccooler.org>
Co-authored-by: BelieveDiffusion <believediffusion@outlook.com>
Co-authored-by: Christopher Albert <albert@tugraz.at>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Brett Thompson <brett@Bretts-Mac-mini.local>
Co-authored-by: Stuart Swerdloff <sjswerdloff@gmail.com>
Co-authored-by: clement-7074f29f <clement-7074f29f@sjstargetedsolutions.co.nz>
Copy link
Copy Markdown
Owner

@raullenchai raullenchai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — clean implementation covering both entry points with tests. Minor nit: normalize_log_level() could be inlined, but not blocking. Thanks @XiaoPengMei!

@raullenchai raullenchai merged commit 2980450 into raullenchai:main Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add --log-level CLI flag

2 participants